Univariate Plots Section

Dataset Summary (continued below)
fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
Table continues below
chlorides free.sulfur.dioxide total.sulfur.dioxide density
Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
pH sulphates alcohol quality rating
Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10 bad : 63
1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53 average:1319
Median :3.310 Median :0.6200 Median :10.20 5:681 good : 217
Mean :3.311 Mean :0.6581 Mean :10.42 6:638 NA
3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199 NA
Max. :4.010 Max. :2.0000 Max. :14.90 8: 18 NA

First, I will beging exploring each variable to get a better sense of the data and how we can use it.

Fixed Acidity

The distribution of fixed Acidity looks close to normal distrubtion that is a little positive skewed. It has some outliers around 16 acidity.

Volatile Acidity

The distribution seems a little bimodal around 0.4 and 0.6 values. There are some extreme outliers that are greater than 0.8 value with some up to 1.5 even.

Citric Acid

For Citric Acid, it seems that the biggest density is with zero. Then we get spikes of density around 0.02, 0.2 and 0.45 values. That could be because of popular citric values that most people prefer in their wine.

Residual Sugar

Most of the density is around value 2 with outliers past value 4 till 15.

Free Sulfur Dioxide

The data looks to be positive skewed distribution with high concentration around 6 value.

Total Sulfur Dioxide

The data here also feels positively skwed with some extreme outliers around 300.

Desnity

The data here looks normally distributed.

PH Values

The values for PH seem to be normally distributed with some farout outliers around 4 and 0.5

Sulphates

The data here looks normal with mean around 0.7 with outliers all the way to value 2. This could be because of special wines that cater to niche tastes.

Alcohol

Alcohol seems to have a similar distribution to sulphur. There could be a relationship there that is worth exploring.

Chlorides

Chlorides seems to be concentrated around 0.08 value where the majority of observations are in. It has some outliers in the extreme.

Rating

The majority of the data is centered around average quality. This could make predictive models based on quality less predictive.

Rating value ranges is deteremined as follows: . Bad – <5 . Average – 5<= Quality < 7 . Good – 7<= Quality < 10

Univariate Analysis

What is the structure of your dataset?

In thd dataset, there are 1599 obseervations of wine. It contains 12 features with only one categorical variable which is quality. The other variables are numerical that describe the wine properties.

Other observations: Most wines are of average quality around 6 value. This makes the dataset unbalanced with far fewer data on more refined wine than average quality. This is understandle since fine wine is supposed to be more rare than average, but gives us less data to draw conclusions and predictions with average or even bad wines.

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality. We could try and determine what consitutes the quality of wine based on its physical values.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are variables that define taste, like citricity, PH and sugar. I am interested in seeing which factors of taste affect the quality of th wine. There is also the alcohol level of the wine and how it might affect quality.

Did you create any new variables from existing variables in the dataset?

I created rating variable that categorizes the quality to imporve visualization.

Of the features you investigated, were there any unusual distributions? Did

There was no unusual data in the dataset.

Bivariate Plots Section

Correlation Matrix

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00
## 
## n= 1599 
## 
## 
## P
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                      0.0000           0.0000     
## volatile.acidity     0.0000                         0.0000     
## citric.acid          0.0000        0.0000                      
## residual.sugar       0.0000        0.9389           0.0000     
## chlorides            0.0002        0.0142           0.0000     
## free.sulfur.dioxide  0.0000        0.6747           0.0147     
## total.sulfur.dioxide 0.0000        0.0022           0.1555     
## density              0.0000        0.3788           0.0000     
## pH                   0.0000        0.0000           0.0000     
## sulphates            0.0000        0.0000           0.0000     
## alcohol              0.0136        0.0000           0.0000     
## quality              0.0000        0.0000           0.0000     
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity        0.0000         0.0002    0.0000             
## volatile.acidity     0.9389         0.0142    0.6747             
## citric.acid          0.0000         0.0000    0.0147             
## residual.sugar                      0.0262    0.0000             
## chlorides            0.0262                   0.8241             
## free.sulfur.dioxide  0.0000         0.8241                       
## total.sulfur.dioxide 0.0000         0.0581    0.0000             
## density              0.0000         0.0000    0.3805             
## pH                   0.0006         0.0000    0.0049             
## sulphates            0.8252         0.0000    0.0389             
## alcohol              0.0926         0.0000    0.0055             
## quality              0.5832         0.0000    0.0428             
##                      total.sulfur.dioxide density pH     sulphates alcohol
## fixed.acidity        0.0000               0.0000  0.0000 0.0000    0.0136 
## volatile.acidity     0.0022               0.3788  0.0000 0.0000    0.0000 
## citric.acid          0.1555               0.0000  0.0000 0.0000    0.0000 
## residual.sugar       0.0000               0.0000  0.0006 0.8252    0.0926 
## chlorides            0.0581               0.0000  0.0000 0.0000    0.0000 
## free.sulfur.dioxide  0.0000               0.3805  0.0049 0.0389    0.0055 
## total.sulfur.dioxide                      0.0044  0.0078 0.0860    0.0000 
## density              0.0044                       0.0000 0.0000    0.0000 
## pH                   0.0078               0.0000         0.0000    0.0000 
## sulphates            0.0860               0.0000  0.0000           0.0002 
## alcohol              0.0000               0.0000  0.0000 0.0002           
## quality              0.0000               0.0000  0.0210 0.0000    0.0000 
##                      quality
## fixed.acidity        0.0000 
## volatile.acidity     0.0000 
## citric.acid          0.0000 
## residual.sugar       0.5832 
## chlorides            0.0000 
## free.sulfur.dioxide  0.0428 
## total.sulfur.dioxide 0.0000 
## density              0.0000 
## pH                   0.0210 
## sulphates            0.0000 
## alcohol              0.0000 
## quality

We can see many strongly correlated variables that would be interesting to explore further. Like alcohol level and rating. Also fixed acidity and density show strong correlation. Some correlations are to be expected like quality and rating or free sulfur dioxide and total sulfur dioxide.

The variables that correlate strongly with quality are: . Alcohol . Sulphates . Citric Acid . Volatile Acidity

We will explore those variables with quality.

Interesting to note that residual sugar has very little correlation with quality.

Quality to Alcohol

It is very interesting correlation. It seems that most high quality wine have high alcohol levels. Average wine (where most of our data is on) has some outliyers on alcohol level. Because of those outliers, we cannot rely on alcohol alone to predict quality.

Quality to Sulphates

Here we see a small upward correlation with sulphates and quality. However, even more than Alcohol levels, we get a large number of outliers that will throw off any predictions.

Quality to Citric Acid

There is a clear positive trend with Citric Acid and quality. The vast majority of low quality wine had very little citric acid while high rating wine had more. However there are outliers with high rating but with almost no citric acid. My guess is they are special wine trying to cater to niche tastes.

Quality to Volatile Acidity

We have a strong negative correlation with volatile acidity. Whiole there are some outliers, this seems like a good predictive variable for quality.

Acidity with Density

This strong linear correlation is to be noted. Acidity seems to be a very strong factor for wine density.

Citric Acid with PH level

This relationship is to be expected, since the more citric acid, the more acidic the wine is and the lower the PH level.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

Alcohol seems to be one of the best predictors even though it suffers from outliers that can throw off the model. But it seems that most highly rated wines have large levels of alcohol. Sulphates has a positive correlation with quality, however it suffers from a large number of outliers that could throw off the model. Citric Acid has a clear positive trend with quality and seems like a very strong predictor. Although there are some outliers, this variable is quite strongly correlated with wine quality. Volatile Acidity has a negative correlation with wine quality. It does suffer from outliers but the trend is quite obvious.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Acidity to Density behaved as expected with a linear relationship. PH and Citric Acid also behaved linearly as to be expected.

W hat was the strongest relationship you found?

The relationship between total.sulfur.dioxide and free.sulfur.dioxide as they do explain each other.

Multivariate Plots Section

In this section we will attempt to explore relationship between variables and quality at the same time.

Alcohol and other variables

Since Alcohol seems one of our strongest relationship, it is worth exploring how it correlates with other variables.

Alcohol and Chlorides

There seems to be a small correlation between chlorides and quality when Alcohol is held constant. What I find surprising is the spike of values in chlorides at low alcohol values. I wonder if it is related to the fermentation process some how.

Alcohol and Total Sulfur Dioxide

There seems to be little correlation between Total Sulfur and Alcohol or Quality.

Alcohol and Density

There does not seem to be correlation between density and quality.

Alcohol and Citric Acid

There is almost no correlation between citric acid and Alcohol. However there is a trend between citric acid and rating that further confirms our previous suspicion.

Acidity exploration

Since Citric acid seems like a strong variable in predicting the quality of wine, its worth exploring how it correlates with other variables and quality.

Alcohol and Citric Acid

There does not seem to be any relationship between volatile acidity and rating, but a downward trend with citric acid.

Alcohol and sulfates

Very little relationship spotted here.

Linear model

Now that we have explored the variables, we will attempt to build a prediction model for the quality of the wine.

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + citric.acid, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates + 
##     volatile.acidity, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates + 
##     chlorides, data = training_data)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)          -0.192        -0.186        -0.612**       0.679**      -0.279     
##                        (0.210)       (0.206)       (0.212)       (0.237)       (0.220)    
##   alcohol               0.368***      0.348***      0.341***      0.312***      0.314***  
##                        (0.020)       (0.020)       (0.020)       (0.019)       (0.020)    
##   citric.acid                         0.737***      0.508***     -0.131         0.574***  
##                                      (0.107)       (0.110)       (0.122)       (0.110)    
##   sulphates                                         0.847***      0.691***      1.071***  
##                                                    (0.126)       (0.122)       (0.133)    
##   volatile.acidity                                               -1.346***                
##                                                                  (0.130)                  
##   chlorides                                                                    -2.498***  
##                                                                                (0.502)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.231         0.263         0.291         0.354         0.307     
##   adj. R-squared        0.230         0.261         0.289         0.351         0.304     
##   sigma                 0.705         0.691         0.678         0.647         0.671     
##   F                   335.604       198.731       152.640       152.474       123.096     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1195.976     -1172.512     -1150.450     -1098.668     -1138.169     
##   Deviance            555.514       532.699       512.103       466.834       500.984     
##   AIC                2397.951      2353.024      2310.901      2209.336      2288.339     
##   BIC                2413.012      2373.104      2336.002      2239.457      2318.460     
##   N                  1119          1119          1119          1119          1119         
## ==========================================================================================

From the R-squared values, model 4 seems our best option.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High alcohol contents and high citric acid seems to be important for good wine.

Were there any interesting or surprising interactions between features?

There seems to be a correlation between chlorides and alcohol. I wonder if it is a byporduct of the fermenting process.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

The prediction power of this model is not very strong. This is because of how outliers are affecting us. Also I suspect the main reason for the poor predictibility is the very little data we have on poor or excellent wine. Most of the data is regarding average wine.

Final Plots and Summary

Here we revist some of the interesting plots.

Quality to Alcohol

While most fine wine has a higher concentration of alcohol as evident from the graph, the distinction betwen average and bad in terms of alcohol concentration is not as prononounced. Clearly here that high concentration of alcohol is important for fine wine, but considering how little data we have on fine wine, it could just be sampling noise.

Alcohol and Chlorides

This correlation is interesting because of the spike of chloride concentration at 7% alcohol level. This could be because of the way this alcohol level is achieved. I can’t see a clear relation between chlorides and quality however.

Linear model Error

The linear model used only covers around 35% of the variance in data to predict quality based on R-squared value of 0.354. This quite a poor model to depend on for quality prediction. This is likely because of the lack of data on fine or poor wine.

Reflection

The wind dataset contains physical as well as sensorial properties of the wine. Chemicals and properties of the wine are recorded as well as an over all rating of the wine. We based the primary factor of our analysis to be Quality.

I started by examining the individual properties in the dataset. Most were either normal distribution or long tailed. However what stood out immediately was how most of the data was on average wine quality. This really created a shadow over the analysis as it lowers the confidence level of the results. Due to the lack of data on fine and poor wine, there was very little in the way of predicting trends.

After doing a pearsons correlation test, it was apparent that those variables correlated the most with quality: . Alcohol . Sulphates . Citric Acid . Volatile Acidity

When exploring the variables further with quality, two variables showed to be stronger in correlating with quality, Alcohol and citric acid. Both variables showed strong outliers however further lowering my expectations for a strong prediction model. It was interesting to note that many observations had zero citric acid. That is expected since citric acid is sometimes added to wine for taste.

When doing multivariate analysis, nothing stood out except for alcohol vs chlorides. Around 6% alcohol level, there is a spike of chloride levels. While little correlation with quality, that spike must be related somehow to the process of fermenting the wine to reach that particular alcohol level or an anomoly in the data.

I proceeded to build a linear regression model even though my confidence in the predictablity provided by the data was low. As expected the model could only predict 35% of the variance.

The dataset sadly is not rich enough for strong analysis. It needs a lot more observation on fine wine and poor wine. Also the rating quality is a subjective value. It is not obvious how this value is obtained or documented.